DOCUMENT RESUME 



ED 242 747 



TM 840 155 



AUTHOR 
TITLE 



i MST I TUT I ON 
REPORT NO 
PUB DATE 
NOTE 

PUB TY'^E 

EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



Koien^ _^ichaei J. 

Standard Errors of tfie Tucker Method for Linear 
Equating under the Common item Nohrahdom Groups 
Design. ACT Technical Bullegiri Number 44. 
American Coll, Testing Program, Iowa City, Iowa. 
ACT-TB-44 
Jan 84 

35p. 

Reports - Research/Technical (143) 

MF01/PC02 Plus Postage. 

Certification; *Cbmparative Analysis; *Equated 
Scores ; *Errbr of Measurement ; Research Methodology; 
♦Sampling? Simulation ; Testing Programs 

Ef irons Bootstrap; Linear Equating Method; 

♦Nonrandomized Design; *Tucker Common Item Equating 
Method 



ABSTRACT 

Large sample standard errors for the Tucker method of 
linear equating under the common item nonrandom groups design are 
derived under normality assumptions as well as under less restrictive 
assumptions. Standard errors of fucker equating are estimatedusirig 
the bootstrap method described by Efron. The resultsfrbiri different 
methods are compared via a computer simulation. as well _ as a real data 
example based on test forms from a professional certification testing 
program. { Author /PN) 



******************************************* 

* Reproductions supplied by EDRS are the best that can be made * 

* from the original document • __ * 

******************************************************** 



ERLC 



O S DEPARTMENT OP EDUCATION 

.NKTtGNAC INSTIT0iE_OF EDO CATION. 
EDOCATIONACEaODRCES INFORMATION 
_ _. CENTER <ERIC> 

y< This document has boon reproduced as 
received from tho person or organ.zation 
originating it 

Minor, changes have been made to improve 
reproduction quality. 

• Points of viijw or opinions stated in this docu_ 
mont do not necessarily represent official NlE 
position or policy. 



ACT Technical Bulletin 



Number 



Standard Errors of the Tucker Method 
for Linear Equating Underthe Common 
Item Nonrandom Groups Design 



"PERMISSION TO REPRODUCE THIS 
MATERIAL HAS BEEN GRANTED BY 



Michael J. Kolen 

The American College Testing Program 
January 1984 



to the educational resources 
Information center ierio." 



2 
ft 



A principal purpose of the ACT Technical Bulletin Series > Vs to 
provide timely reports of the results of measurement research 
at ACT. Comments concerning technical bulletins are solicited 
from ACT staff, ACT's clients, and the professional community 
at large. A technical bulletin should not be quoted without 
permission of the author(s). Each technical bulletin is automatically 
superseded upon formal publication of its contents; 



Research arid Development Division 

The American College Testing Program 

P.O. Box 188 

Iowa City, Iowa 52243 



ERIC 



2 



Standard Errors of the tucker Method for Linear Equating 
Under the Common Item Nonrandom Groups Design 



Michael J. Kblen 
The Aiterican Gotiege Testing Program 



The author thanks David Jar joura and Robert L. Brennan for t^eir comments 
and suggestions . 



TABLE OF CONTENTS 



Page 

Tucker Common Item Equating with Nonrandora Groups ........ 2 

T^arge Sample Standard Errors 8 

Computer Simulation H 

Bootstrap Standard Errors .............. ^ •••• • i7 

Real Data Example .................. I s 

Discussion. •; • 

References. 23 

Tables 24 



ERLC 



LIST OF TABLES 



Table * p age 

L Partial Derivatives of Tucker Linear Equating Equation 24 

With Respect to Each Sample Statistics arid Evaluated at 
the Parameters 

2 Sampling Variances and Covariances of Bivariate Moments 25 

3 Population Means, Standard Deviations, Skewness , and 26 
Kurtosis for Simulated Observed Score Distributions 

4 Standard Errors of Tucker Equating for Two Simulated 27 
Tests and at Two Sample Sizes 

5 Raw Score Summary Statistics for Forms X and Y and 2S 
Common Items V for a Professional Certification Program 

6 Standard Errors of Tucker Equating for a Professional 29 
Certification Program 



ERLC 



ABSTRACT 

Large sample standard errors are derived for the Tucker Linear test score 
equating method under the common item riorirandom groups design. Standard 
errors are derived without the normality assumption that is commonly made in 
the derivation of standard errors of linear equating. The behavior of the 
standard errors is studied using a computer simulation and a real data 
example * 

Key Words: Equating, Standard Errors , jtoarandom Groups , 



Standard Errors of the Tucker Method for Linear Equating 
Under The Common Item Nonrandom Groups Design 

Test form equating of observed scores adjusts for small differences in 
difficulty araon£ multiple forms of a test for a specified population of 
examinees. Such equating requires a design for collecting data *nd a method 
For equating forms. The common item nonrandom groups design. [Angoff, 1971, 
pp. 579-583] is a design in v/hich two groups of examinees from different 
populations (nonrandom groups) are each administered different test forms that 
have a subset of items in conimort. Linear methods undev this design are 
examined in the present paper. 

Standard errors of equating are a means for expressing the amount of 
error in test form equating that is due to examinee sampling. For a given 
score i one form of a test, the error in estimating its equated score on 
another form is often indexed by a standard error. These standard errors 
generally differ by score level. Standard errors of equating are used as a 
means for expressing equating error when scores are reported, in the 
estimation of sample size required to achieve a given level of equating 
precision, and as a basis for comparing equating methods and designs. 

Large sample standard errors for the Tucker method of linear equating 
under the common item nonrandom groups design are derived under normality 
assumptions as well as under les« restrictive assumptions in the present 
paper. Also, standard errors of Tucker equating are estimated using the 
bootstrap method described by Efron [1982]. The results from different 
methods are compared via a computer simulation as well as a real data example 
based oh test forms from a professional certification testing program. 



ERLC 



7 



- 2 - 



Tucker Common I tern Equating with Nonrandom Groups 



Multiple forms of a tes t to be equated are des lgried to similar in 
content and statistical characteristics; For the common item nonrandom groups 
design, a new form is equated to an old (previously equated) form using a i3et 
of i terns that are common to the two forms • The set of common items Is 
constructed to be similar to each of the full length forms in content balance 
and in the statistical characteristics of its items. Scores on the common 
items may contribute to the total score on each form (an internal set of 
common items) or they may not contribute to the total score on each form (an 
external set of common items). 

The new and old forms are administered to H^aminees from different 
populations under this design. In order to accomplish observed score 
equating, a decision must be made on how to combine these two populations . 
The combined population, which has been referred to as the synthetic 
population by Braun and Holland [1982], is a weighted combination of the two 
populations from which data ate gathered. 

Refer to the new test form as X, the old form as Y, and the set of common 
items as V. Examinees from Population 1 are administered X and V. Examinees 
from Population 2 are administered Y and V. Consider that these two popula- 
tions are weighted using proportional weights w^ arid w? (where w i 4" w 2 ^ * and 
Wj , W2 >_ 0) to form the synthetic population. The general linear equation for 
equating scores on X to the scale of Y is: 




8 



ERIC 



In this equation ii-(X)* u(Y), d-CX)* arid d-(Y) ate the means arid standard 

s s s s 

deviations of scores on X and Y for the synthetic population, and %(x) is 
the value of the linear equating function at x. 

The parameters in ( I ) depend on parameters in Populations 1 and 2. From 
equating administrations we can obtain estimates of the following for 
Population 1: 

M^(X) = mean for X , 

d^(X) = staridard deviation fbt 1 X , 

(V) « mean for V , 

a^(V) = standard deviation for V , and 

(X ,V) =* covariance between X aad V , 

and for Population 2: 

U^(Y) = mean for Y , 

a^(Y) = standard deviation foe Y , 

jj-(V) = mean for V 

o f) (\ J ) = staridard deviation for V , and 

( ;I Y i V ) - covariance between Y and V 

Note that from the equating study we are unable to obtain estimates of the 
following for Population t: 

Mj(Y) ■ mean for Y , 

dj(Y) ■ standard deviation for Y , arid 
OiC^iV) = covariance between Y and V , 



- 4 - 



ERIC 



and for Population 2: 



u^C'O = mean for X , 



a 9 (X) = standard deviation for X ^ and 
o^ ( X , V ) ■ co variance between X arid V . 



This is so because Y is rot administered to examinees from Population L and X 
is not administered to examinees from Population 2i 

The assumptions used to arri'/e at expressions for these parameters 
distinguish the Tucker method from other Linear methods for common iteik 
equating under the nonrandora groups design. The Tucker method requires that 
the linear regression of X on V be identical for Populations I and 2. A 

similar assumption is required for the regression of Y on V; tet a represent 

_ _2_ 

a regression slope so that, for e:;ample , (X I V) - (X ,V) / ( V) is the 

slope for the linear regression of X on V for Population 1. Let 8 represent a 

regression intercept so that, for example, fjj(xjv) = (X) - ot^ (X j V)u j (V) is 

the intercept for the linear regression of X on V for Population L. The 

Tucker method requires that, 



-(X|V) = ct 2 (X|v) , 

a L (Y|v) = a 2 (Y|V) , 

Pj(xjv) - PjCxjv) , and 

3 L (Y|V) = 3 ? (Y|V) . 



Lit Tucker equating, it is also assumed that 



aJ(X)[l - p2(X,V)j 58 cr|CX)[l - p2(X,V)] and 



10 



- 5 - 



ERIC 



0 2(Y)[l - p2(Y,V) : = o|(Y)[i - p|(Y,V)j , 

where p 2 refers Co a squared correlation, so that, for example, p 2 (X,V) = 
o l (X,V)/[o 1 (X) o^V) J . This is sometimes referred to as the assumption that 
the variance errors of linearly estimating X from V as well as Y from V are 
the same for the two populations. Sometimes stronger assumptions are used for 
deriving these equations, such as those used by Braun and Holland [1982], but 
the assumptions listed in this paper are sufficient. 

Given these assumptions , ''.t can be shown that for Population 1, 

U L (Y) = u 2 (Y) + a 2 (Y|V) [iljCVj - ii^V)] , (2) 
= <|(Y) + a|(Yjv) [(^(V) - aij(V)] , and - (3) 



a-(Y,V5 = a 2 (Y,V) — ? 



crS(V) 



And, for Population 2; 



4(V) 

o 2 (X,V) = dj(XiV) —4 



11 



(5) 



(5) 



U ? (X) = u t (x) - ctjUlV) [ U L (V) - u 2 (v)j , 
a 2 (X) = a\(X) - d-(xjv) f»*(V) - d|(V)j , and (6) 



(7) 



- 6 - 



In order to arrive at the Tucker equating equation, it ts necessary to 
obtain expressions for the means and variances of X arid Y for the synthetic 
population; These parameters are expressible in terms of parameters for 
Populations I and 2 as follows: 

y (X) = w, y, (X) + w 0 y„(X) 

3 11 LI 

IJ-(Y) = Wjiij?Y5 + w 2 ;i 2 (Y) , 

<??(X) = w. 0**30 + w 2 a|(X) + w-w^ia-^X) - p 9 _(X)] 2 , and 
aJ(Y) = WjOjStJ + w 2 a 2 (Y) + Wjii^iijW - y 2 (Y)] 2 

Subs t i. tut ing (2) through (7) in the above equations gives: 

y g (X) = y^X) - w ? a 1 (X|V):y 1 (V) - u 2 (V) j , (8) 
y g (Y) = y 2 (Y) + W[ a,(Y | V) [y 1 (V) - y 2 (V) J , (9) 

a Z . m = °*<X) - w 2 aJ(x|V)raJ(V) - ij^Vj] 

+ w 1 w 2 a 2 (X|V)[y 1 (V) - y 2 (V)] ? ' , arid (10) 

o 2 (Y) = a 2 (Y) + w^CYlV^o-^CV) - a\m j 

+ w 1 w 2 a 2 (Y|V)[y 1 (V) - y 2 (V)] 2 , (11) 



where all parameters to the right of equal signs in (8) through (11) can be 
estimated directly using data from the study design. Equations (5) through 



12 



- 7 - 



(11) can be entered into (1) to produce the Tucker linear equating function. 

Also, the mean and variance of V for the synthetic population can be 
expressed, respectively, as: 

p_(V) - w, u, (V) + w 0 u 0 (V) and 

4 (V) " w i a l (V) + w 2 a 2 (V) + w l W 2^ y l (V) " y 2 (V) ^ # 

It can be shown that the combination of (3) through (11) and (12) arid 
(13) will prodace counterparts of the Tucker method equation described by 
Angoff [1971, p. 580], if weights are chosen proportional to sample size that 
is, w L = ai /(n L + n?) and w 2 - n 2 /(n L + n 2 ) , where u t and n 2 are the sample 
sizes of examinees included in the equating study from Populations 1 and 2, 
respectively. Gulliicsen [1950, pp. 299-301] presents a version of the Tucker 
method that differs from Angoff's version. The present equations will result 
in counterparts of Gulliksen's, If we set w^ = 1 arid W9 = 0. 

Large S ample Standard Errors 
Kendall and Stuart [1977, pp. 246-247] present a general method for 
approximating standard errors which is based on the Taylor expansion. This 
method is often referred to as the delta method. Lord [1950] presents 
standard errors of linear equating derived using the delta method under a 
variety of data collection designs, and many of these standard errors are 
reported by Angoff [1971]. However, standard errors of Tucker equating are 
not presented in any of these sources. (The standard errors presented by 
Angoff [1971, p. 577] were derived by Lord [1950] for common item equating 

i3 

ERIC 



(12) 
(13) 



- 8 - 



with random groups, under the assumption that P^(V) = P2^V) and 

o^(V) * a-(V) . Thus, they are inappropriate for the honraridbm groups 

situation. ) 

In applying the delta method to standard errors of linear equating, to i 
[1950] made what we will refer to here as the Jtiormatity assumption . For 
equating designs that require consideration of bivariate distributions, the 
normality assumption is that all of the central moments through order 4 of the 
score distributions are identical to those of a bivariate normal distribution, 
and for equating designs that require consideration of only univariate 
distributions, the normality assumption is that the central moments through 
order 4 of the score distributions are identical to those of a univariate 
normal distribution. 

Recently, Braun and Holland [1982, pp. 32-35] derived standard errors 
using the delta method without making such a restrictive assumption for the 
situation in which randomly equivalent groups of examinees are administered 
the forms to be equated. Their resulting standard error expressions suggest 
that standard errors of equating based oh the normality assumption may produce 
misleading results when score distributions are skewed or more peaked than a 
normal distribution* Because skewed distributions are typical of many testing 
programs, we derive standard errors of Tucker equating without the normality 
assumption in the present paper. We also derive standard errors with the 
normality assumption for comparison purposes. 

Let 6^ , 6^, be used as an alternate representation of 

Mj(X); M^V), a\m- 9 a*(V), a^X^), n 2 (Y), m 2 (V), a^Y), a*(V), and a 2 <Y,V), 
respectively, and iet 0^ 0^, represent their estimates. For 

example^ 9 is ah alternate representation of p,(X) • Let z = £(x) 



14 



- 9 - 



represent the estimated Tucker linear equating function arrived at by 
substituting estimates of parameters into (3) through (11) and substituting 
thcce into (1). Let l[ represent di/dB ± (the partial derivative of I with 
respect to 9 ) evaluated at 8^ 0^ •••» 8 L0 # Then by clle delta m&iil6i 
described by Kendall and Stuart [1977, pp. 246-247], 

10 j . to . . 

varU(x)] = Z if var(6.) + ZZ cov(6 ,0 ) 

i=l 1 1 i^j = l J J 

Because samples are independently drawn from Populations 1 and 2, the sampling 
covariances between each of the first five 8^3, and each o! the last five 
are zero. Thus, 



10 9 5 



var 



[*(x)l = z var(e. ) + z z z: i: cov(e ,e ) 

t=l 1 1^=1 J 3 



10 . * 

+ Z Z Zl Z: cov(6 t ,6,) . (14) 
i*j=5 3 

The partial derivatives U^ f s) necessary for (14) are shown in Table 1. 
For this table, = [x - uJX) ]/a g (X) . All other notation has been 
defined previously. The sampling variances and covariances for (14) can be 
obtained from Table 2. In this table, n refers to sample size, (Note that 
the variables X and Y in Table 2 are general.) By substituting the partial 
derivatives from Table 1 and the sampling variances and covariances from the 
"general" column in Table 2 into (14), we obtain the equation for the variance 
error of Tucker equating. (Use the "normal" column of Table 2 for the 
variance error under the normality assumption.) Note that the standard error 
for Tucker equating is se [£(x) ] {var [&(x) ] } 



ERIC 



- 10 - 



Insert Tables 1 and 2 About Here 



As an example of how to proceed, refer to the first term in the first 

2 a 2 2 2 

summation in (14), which is Z* varOj) . From Table 1 , Z* is a s (Y)/a s (X) , 

a ~ _ 2 

and from Table 2, var(9^) = var ■ o|(X)/nj^ • Note that this term is 

the same under the general or the normality assumption. As another example, 

refer to the second term in the second summation of (14), which is 

Z^Z^ cov(9 1 ,e 3 ). From Table 1 , £J - [-a g (Y) / o" s (X) ] {-z x a g (Y)/[2a*(X) J } , 

and from Table 2, covC^ ,^) = cov [ q- (X) , (X) ] = E [x-u 1 (X) ] 3 /n 1 under 

general conditions. From the table ^ this terra would be ^erb under the 

normality assumption. 

The standard errors will not be written here in more detail than (L4) 
because their full form is too cumbersome; However, the standard errors are 
easily programmed via computer . 

Clearly, the standard error expression is complicated. For this reason, 
it is difficult to make general interpretative statements. One such 
observation, however, is that if the sample sizes for the two groups are 
equal, then there is a simple relationship between the variance error and 
sample size — namely, the magnitude of the variance error is inversely 
proportional to sample size. For example, a doubling of the sample size will 
lead to a halving of the variance error. 

In practice, when standard errors of Tucker equating are estimated, 
parameter estimates must be used to calculate the derivatives shown in Table 1 



16 



- it - 



and Che sampling variances arid covariarices shown in Tsble 2. Under the 
normality assumption, we need to estimate means, variances, and covariarices to 
obtain the sampling variances and covartances in Table 2. However, under 
nonnormality, we also need to estimate skewness, kurtbsis* and several higher 
order cross-product moments. 

Computer Simulation 

A computer simulation was conducted to study the behavior of the esti- 
mated standard errors. Score distributions were simulated to reflect the 
score distributions of test forms from two different testing programs. The 
distributions for two test forms model those in a particular professional 
certification program. (Real data for real forms of a test in this program are 
used in a subsequent illustration.) These distributions are negatively 
skewed, and the simulation based on these distributions is referred to as the 
nonsymmetric simulation. 

Distributions for two forms of a second test are modeled after the mean, 
standard deviation, skewness, and kurtosis found in an achievement testing 
program. The simulation based on these distributions is referred to as the 
nearly; symmetric simulation. The distributions in the nearly symmetric 
simulation are flatter than a normal distribution. (Lord [1955] surveyed 
distributions for a variety of tests arid found that symmetric test score 
distributions tend to be flatter than the normal distribution, and he 
references theoretical discussions of this issue - ) 

For parposes of the simulation, we assume that true scores (on the 
proportion-correct scale) are distributed as a two-parameter beta 
distribution, arid that given a particular true score, the observed score 



17 



- 12 - 



distribution can be described by the binomial, the resulting distribution of 
observed scores under these conditions is the negative hypergebmetr ic [Lord & 
Mbvicfcj 1968^ pp. 515-321]. 

For the n^nsymaetric simulation, the beta true score distributions of X 
and V for Population 1 were assigned parameters a = 10*5 and b =* 3i0i And, 
for Population 2 the beta true score distributions of Y and V were assigned 
parameters a = 9.5 and b =* 3.0. The numbers of Items contained on these 
simulated test forms are 125 for X and Y and 30 for V. 

For the nearly symmetric simulation, the beta true score distributions of 
X and V for Population 1 were assigned parameters .x - 6.G and b = 6;2; And, 
for Y and V for Population 2 the parameters assigned were a = 5.4 and 
b = 5.2; The numbers of items contained on these simulated tests are 52 for X 
and Y and 15 for V. 

Population means * standard deviations , skewness indices , and kur tosis 
indices of observed scores are shown in Tabic 3 for the simulated test 
forms. The nortsymme trie distributions are relatively easy (over 75% of the 
items answered correctly, on average) , negatively skewed, and have a kur tosis 
index higher than that for a normal distribution, indicating more 
peakedness . The nearly symmetric distributions have means near 50% of the 
items answered correctly, are nearly symmetric, and are less peaked than a 
normal distribution. 



Insert Table 3 about here 



is 



- 13 - 



For the simulation, let k x represent the number of items on X, ky the 
number of items on Y, and k v the number of items on V. Also, for the 
simulation k x = k y . Define k g = k^ - k^. Because we are simulating an 
internal set of common items, k g represents the number of items in X and Y 
that are not common, and k v represents the number of common items. 

Consider the nonsymmetric simulation for a sample size of 100 examinees 
per test form with the previously defined beta parameters. The following 
steps were used to simulate pairs of X and V scores: 

(t) Randomly generate a beta variate from the two-parameter beta 

distribution for X and V in Population 1. This beta variate, which is 
referred to as p, represents a proportion-correct true score. (fMSt, 
1982 subroutine GGBTR was used to generate the true score.) 
(ii) Randomly generate a variate from a binomial distribution with 

parameter p based on k v trials. This variate represents observed 
score on V. (IMSL, 1982 subroutine GGBN was used to generate binomial 
variates . ) 

(i±t) Randomly generate a variate from a binomial distribution with 

parameter p based on k trials. This variate represents observed 



score on the non-common items, 
(iv) Add together the binomial variates from steps ii and iii. This sura 
represents observed score on the total test form; X; 
(v) Repeat steps i through iv n times, where n represents th- sample size 
used in the simulation. This results in a set of n pairs of observed 
scores for X and V. 
Next, by substituting Y for X and Population 2 for Population I in the 
above steps, n pairs of observed scores were generated for Population 2 using 



IB 



ERIC 



- i4 - 



Che appropriate beta parameters. At this point, we have n pa:trs of scores on 
X and V for Population I and n pairs of scores on Y and V for Population 2. 
Based on these simulated data, a Form Y equivalent of each Form X integer 
score was obtained usin^, Tucker equating with w^ - w^ 88 0.5. Also, standard 
errors of equating were estimated for each X (integer) score based on the 
delta method with the normality assumption as well as the delta method without 
the normality assumption; This whole process was replicated 500 times • 

The "true" standard error of equating for a given integer score on X is 
defined here as the standard deviation of Form Y equivalents of that score 
over the 500 replications. The nonnorraal delta method standard error 
associated with each X (integer) score is the mean delta method standard error 
derived without the normality assumption over the 500 replications. The 
normal delta method standard error is define" similarly. 

Noiisymmetric and symmetric simulations were each conducted using sample 
sizes of 100 and 250 simulated examinees per form. The "true", nonnormal , and 
normal standard errors at selected score points are shown in Table 4. Also 
shown are root mean squared errors (RMSE) in estimating the standard errors. 
A.s an example of how to interpret Table 4 consider the top row. The data in 
this row are for the nonsymme trie simulation with sample size of 250 examinees 
per test form, as indicated in the table. This top row gives standard errors 
for estimating Tucker equivalents on Form Y of a score of 120 on Form X. The 
"trae" standard error is iiOt, the nonnormal standard error 0.96, and normal 
standard error 1.12. Root mean squared errors in estimating the nonnormal arid 
normal standard error also are shown. 



20 

ERIC 



- 15 - 



Inset'- Table 4 about here 



For both nonsymmetric simulations, the normal standard errors tend to be 
different in pattern from the "true" standard errors. The nonnormal standard 
errors are similar lii pattern to the "true" standard errors. However, at a 
sample size of 100 per focra the nonnormal standard errors are uniformly too 
small. At 250 examinees per form the nonnormal standard errors are similar to 
the "true" standard errors. 

For a sample size of 100 per form in the nearly symmetric simulation, 
both the nonnormal and normal standard errors are not too dissimilar from the 
"true" standard errors. The nonnormal standard errors tend to be too small 
while the normal standard errors tend to be too large. For a sample size of 
250, the nonnormal standard errors ire nearly identical to the "true" standard 
errors, whereas the normal standard errors are too large. 

Root mean squared errors in estimating the delta method standard errors 
also are shown in Table 4. to calculate RMSE we find the variance or the 
estimated standard errors over the 500 replications and add to it the squared 
difference between the "true" standard error and delta method standard 
error. The square root of this sum is the RMSE. The RMSE is a measure of the 
variability in estimating standard errors. Smaller values of RMSE are 
indicative of more accurate estimation. 

Recall that the estimation of the normal standard errors requires 
estimation of only means, variances, and covariances, whereas the estimation 
of the nonnormal standard errors requires the esti ation of these parameters 



21 



ERIC 



- 16 - 



as well as ! igher order central moments and cross-product moments; Because 
higher order moments and cross-product moments may be difficult to estimate 
precisely due to sampling variability, the norinorraal standard errors may be 
more difficult to estimate than the normal ones . However, ~^r all but the 
nearly symmetric simulation with sample size of 100 in Table 4, the RMSE is 
smaller for the nonnormal standard errors than for the normal standard errors; 

The results of the simulation indicate that for both tests simulated, the 
nonnormal standard errors are more accurate than those based dh normality 
assumptions when sample size is 250 examinees per form. 



Even though the simulation provides evidence of the behavior of the 
standard errors ^ a study of the delta me th d standard errors of equating using 
actual tes*: data seems desirable. Efrori [1982] describes an alternative to 
the delta method which he refers to as the bootstrap, and he presents a 
variety of examples in which the bootstrap resulted in standard errors that 
were more accurate for small sample situations than those based on the delta 
me thod. 

The computation of bootstrap standard errors requires extensive 
resampling from the sample data. Thus a high-speed computer is essential. 
Generally, to compute bootstrap standard errors, a random sample is drawn with 
replacement from the sample data at hand* the statistic of interest is 
calculated, and this process is repeated a large number of times. The 
bootstrap standard error is the standard deviation of the computed values of 



Boot-3-trai? -Standard J^rrors 



ERIC 




- 17 - 



the statistic over repetitions of the process; The following seeps are used 
to bootstrap standard errors of Tucker equating. 

it) be^n with the ri 1 examinees froi Population 1 with scorns on X and V 
and the n 2 examinees from Population 2 with scores on Y and V. 
(ii) Draw a random sample with replacement of size nj examinees, from the 
sample data of the nj examinees from Population 1. The sampling 
involves drawing pairs of X and V scores. Since the sampling is with 
replacement, a particular examinee's score pair easily could be chosen 
more than once • 

riii) Draw a random sample with replacement of size n 2 examinees, fr »ra the 
sample data of the n 2 examinees from Population 2. 
(iv) Estimate the Tucker equiv^ent at x using the data from the random 
samples drawn in steps ii and tit, and refer to this estimate 
as Z^Cx) . 

(v) Repeat steps ii through iv B times obtaining bootstrap estimates 
X 1 (x),K 2 (x),...,£ B (x) . Approximate the standard error by: 



sew---UCx>- = { \ [t h M - Mx)] 2 /(S - Dl 1/2 , 

Boot 1 b=i d 

B « 

where, I (x) = S_ Zr(x)/B . 
b=*l 



These procedures can be applied at any x. 

Real Data E xample 
Data from forms X and Y of a 125 item multiple choice professional 
certification testing program are used in this example. Form X was 



(15) 



23 



- 18 - 



administered to 773 examinees from Population 1 arid Form Y to 795 examinees 
from Population 2, arid the forms were administered one year apart. The two 
forms contain a common set of 30 items, referred to as V. Summary statistics 
are shown in Table 5; The means suggest that the forms and common items were 
fairly easy for these examinees. The average examinee correctly answered 
approximately 77% of the items. According to the skewriess indices, the score 
distributions are markedly skewed, and the kurtosis indices indicate that the 
distributions are more peaked than a normal distributions 



Insert Table 5 about here 



Results from Tucker equating with =* w^ = 0.5 and standard errors cf 
equating are shown in Table 6. Consider a Form X raw score of 100 in the 
first column of the lable. Reading across* this scoce has a percentile rank 
of 54.7 and a Form Y equivalent of 102.7. the standard error of this equiva- 
lent is 0;33 under normality assumptions, Q;29 without these assumptions, and 
0,28 using the bootstrap. A. one standard error band for the Form Y 
equivalent of a Form X score of 100 is 102.7 + 0.29 or appr ciraately ( 102*4, 
103.0) for the delta method standard errors derived without the normality 
assumption; 



Insert Table 6 about hei^ 



- 19 - 



Generally, Che standard errors are smallest near the average score and 
become larger as we move awa/ from the average score, the standard errors 
under the normality assumption aire slightly larger at the higher scores and 
are smaller at Che lower scores than those derived without the normality 
assumption and those calculated by the bootstrap. Standard erro.s for the 
bootstrap snd the delta method without the normality assumption are nearly 
identical. 

For this testing program the passing score is usually close to a raw 
score of 80. So, equating error is crucial in this region. From Table 6, at 
:i raw score of 80, the delta method standard error of equating is .44 under 
the normality assumption and .54 without such an assumption. The error 
variances are, respectively, .19 (.44 2 ) and .29 (.542). Thus at a score of 
80, the error variance under normality is only 66% [ 100 ( . 19) / . 29 ] of the size 
of the error variance under the less restrictive assumption. Based on the 
less restrictive assumption, these results suggest that instead of 
approximately 780 examinees, we would need approximately 1,182 (780/. 66) 
examinees to obtain the precision implied by the error variance based on the 
normality assumption, which is a substantial difference. The close agreement 
between the bootstrap standard errors and the delta method standard errors 
derived without the normality assumption in combination with the findings from 
the previously discussed nonnorraal simulations shown in the RMSE column in 
Table 4 suggest chat, for this real data example, the sample size estimates 
using the standard errors based on the normality assumption likely are not as 
accurate as those based on the less restrictive assumption- 




= 20 - 



Discussion 

The results of the computer simulation indicate that for Tucker equating* 
the standard errors derived without the normality assumption are more accurate 
than those derived with the normality assumpti- i for sample sizes of 250 or 
more examinees per test form. The results also indicate that the stand* rd 
errors derived with the normality assumption may be acceptable when test score 
distributions are nearly symmetric, but these standard errors appear to be 
inadequate for nonsymme trie distributions. The results of the real data 
example indicate that the standard errors with the normality assumption may 
suggest substantially more equating precision in the crucial range than Is 
actually the case. 

In the real data example, the bootstrap standard errors are very similar 
to the delta method standard errors derived without the normality assumption* 
which are preferable to the bootstrap standard errors for cost and ease of 
computation reasons. Still, the results are encouraging for the use of the 
bootstrap in equating contexts. Ultimately, the bootstrap may prove useful 
for estimating standard errors of equating in complicated situations such as 
in chains of dependent equipercentile eqdattngs or in smoothed equipercentile 
equating^ 

The standard errors derived in this paper index equating error that is 
due to examinee sampling. Error that results from a failure to meet the 
assumptions required for Tucker equating is not reflected in the standard 
errors. Braun and Holland [ 1982] suggest some procedures for checking these 
assumptions , although they indicate that not all of the assumptions are 
testable without collecting additional data. The assumptions required in 



26 



- 21 - 



Tucker equating seem most reasonable for testing programs in which: 

(i) examinee populations do hot change much from test date to test date; 

(ii) the content balance of the set of common items is very similar to the 
content balance of the total test forms, and the total test forms are each 
built to the same specifications; and (iii) the statistical characteristics of 
the set of common items are very similar to the statistical characteristics of 
the total test forms, and the total test forms are similar to one another in 
statistical characteristics. Characteristics ii and iii are most readily 
achieved for testing programs in which tests are constructed from a large pool 
of items with item statistics that are accurately estimated from pretesting or 
previous use. 

One reason for deriving standard errors without the normality assumption 
is that many testing programs produce score distributions which deviate 
markedly from a normal distribution. For example, professional certification 
testing programs often produce markedly negatively skewed score distributions 
that result, in p;rt, because the mean score on such examinations is often in 
the range of 68% to 80% of the items correct. Many of the testing programs 
that produce skewed distributions are equated using linear methods under the 
common item nonrandom groups design with large sample sizes (250 or more 
examinees per test fori). For such programs, the results of the analyses in 
this paper indicate that for Tucker equating, the standard errors derived 
without the normality assumption are reasonably accurate and preferable to 
those derived with the normality assumption. 



27 



ERIC 



- 22 - 



References 

Angoff, W. H. (1971). Scales, norms, arid equivalent scores. In R. Li 

Thorndike (Ed.) , Educational Measurement (2nd ed.). Washington, D.G.: 
American Council on Education, 508-600. 

Braun, H. I., & Holland, P. W. (1982). Observed-Score test equating: A 

mathematical analysis of some ETS equating procedures. In P. W. Holland 
and D. B. Rubin (Eds.), test Equating. New York: Academic Press, 9-49; 

Efroa, B. (1982). The jackknife, the bootstrap, and other resampling plans. 
Philadelphia, PA: Society for Industrial and Applied Mathematics. 

Gullikseri (1950). Theory of mental tests . New York: Wiley. 

International Mathematical and Statistical Libraries (1982). Reference Manual 

(9th ed.). Houston: Author. 
Kendall, M. , & Stuart, A. (1977). The advanced theory of statistics 

(Vol. 1). New York: Macmttlian. 
Ldrd t F. M. (1950). Notes on comparable scales for test scores (RB-50-4S). 

Princeton, N.J.: Educational Testing Service, 
Lord, Fi M. (1955). A survey of observed test-score distributions with 

respect to skewness arid kurtosis. Educational and Psychological 

Measurement , 15- , 383-389. 
Lord, F. M. , & Novick, M. R. (1968). Statistical theories of [ mentaj^ ^est, 

scores . Reading, MA: Add! son-Wesley. 



28 

ERIC 



TABLE i 



Partial Derivatives of Tucker Linear Equating Equation 
With Respect to Each Sample Statistic and Evaluated at the Paramete 



Statistic 


Derivatives Evaluated at Parameters 




-o (Y)/a (X) 
s s 




w 2 o g (Y)dj(xjv)/dg(X) +w 1 w 2 z x 4(Y|V)[ U1 (V) - u 2 (V)]/o g (Y) 
- Wl w 2 o g (Y)z x 4(X|V)[,. 1 (V) - u 2 (V)]/o;(X) + Wl a 2 (Y|V) 


o"(X) 


- Zjc a s (Y)/[2^(X)] 

-w 2 o 3 (Y) a 1 (X|v)[y 1 (V) - u 2 <V) 1/ [o g (X) o*<V) ] 
-Hw 1 z x <4 i (Y|V)/[2o-(Y)] 

-a (Y)z a?(x|V)[l + w. - 2o*(V)/o?(V)]/[2o*(X)] 
sxl 1 5 1 3 


o t (X,V) 


w? o g (Y)[u 1 (V) - P 2 (V)j/[o s (X)o-(V)] 

-a (Y)z a,(X|V)[o*(V)/o*<V) -1 ]/a|(X) ; 




1 


u 2 (v) 


-w 2 d s (Y) ai (XiV)/o g (X) " Wl w 2 2 x c^(Y|V)[y 1 (V) - U 2 (V) :/a g (Y) 
+w lW2 o § (Y)z x aJ(X|V)[ Ul (V) - u 2 (V)]/o g (X) - w^ttlv) 




z x /[2a g (Y)] 




-w.o (Y)z af(X|v)/[2o?<:x)j - w, <x, (Y | V) [y ^ ( V) - ^(V) ]/o*(V) 
+z x ^(Y|V)[l + w 2 - 2o^V5/o|(V)]/[2o s (Y)l 


a 2 (Y,V) 


z x a 2 (Y|v)[o^(V)/o^(V) -l]/o g (Y) 
+w 1 [y 1 (V) - y 2 (V)]/o 2 (V) 


2B 



Sampling Variances and Cbvariahces of Bivariate Moments 



Sampling Variance or Sampling Variance or 

Statistics) Covariance— General Covariance-Normal Distribution 



. A _ . . 

varfiiflM 1 




2 

ff 7yW n 
o \\)iu 




2 

<J w/n 


A 1 - 

varld (X)l 




lilx - ii(X)f - oVx)'i/n 




■ i 

-U V, Ay / 11 


A 

var[a(X,Y) 


1 

J 


|E[X - p(X)| [Y - p(Y)] - 


a (X,Y)|/n 


[a (X)a (Y) + <T(X,Y)]/n 


A 

cov[p(X) | 


A 

"7li\ 1 

p(Y)J 


o(X,Y)/n 




o(X,Y)/n 


A 

covjp(X) , 


"2 i 

m] 


r :3 

E[x-p(X)j /n 




0 


A 

cbvjp(X) , 




E[X - ji(X)][l - ji(Y)j 2 /h 




0 


A 

cov[p(X) , 


a(X,Y)| 


E[x-p(X)] 2 [Y-p(Y)|/n 




0 


cov[o (X) | 


Aft 


js[t - p(X)] 2 [Y - p(Y)] 2 - 


o 2 (X)o 2 (Y))/n 


2o 2 (X,Y)/n 


cov[o 2 (X) , 


A 

■ ff(X,Y)j 


W - p(X)f [l " p(Y)j - 


o 2 (X)o(X,Y))/n 


2o(X,Y)a 2 (X)/n 





Note : The terras in the body of the table were adapted from Kendall and Stuart (1977, pp. 85, 245, 
246 and 250) and are typically based on large sample theory. Also, E refers to expected 
value. 



30 



- 25 - 



TABLE 3 

Population Means , Standard Deviations, Skewness, and Kurtosis 
foe Simulated Observed Score Distributions 



Number of Standard 
Variable Population Items Mean Deviation Skewness Kurtosis 



X 1 

V 2 

V I 

V 2 



X 1 



Nonsymraetric 

125 97.22 15.37 -0.66 3.24 

125 93.75 15.72 -0.60 3.09 

30 • 23.33 3.94 -0.67 3.23 

30 22.50 4.26 -0.60 3.09 



Nearly Symmetric 



52 25.57 7.95 0.02 2.60 



Y 2 52 

V 1 



26.59 8.37 -0.02 2.55 



15 7.39 2.78 0.02 2.55 

* 2 15 7.64 2.88 -0.02 2.51 



Note: Skewness is Pearson's /3, and kurtosis is Pearson's 3 2 index. 



32 

ERIC 



- 26 - 



TABLE 4 

Standard Errors of tucker Equating for Two Simulated tests 
and at two Sample Sizes 



RMSE in Estimating 
Standard Error Standard Error^ 

Score on — Z " 7 « i 

Form X "True" Nonnormal Normal Monnormal Normal 







Nonsymmetric 


h^=ri2 = 250 






126 

iib 

1UU 

90 
80 
70 
60 
50 


i .01 

0.70 

U . JO 

0.75 
1.09 
1.48 
1.89 
2.32 




0.96 
0.68 
0.59 
0.78 
1.10 
1.47 
1 .87 
2.27 


1.12 
0.81 
0.63 
0.69 
0.94 
1.28 
i .65 
2.03 


.08 
.04 
.02 
.05 
.08 
.11 
.15 
.19 


.13 
.12 
.06 
.07 
.15 
.21 
.26 
.31 






Nonsymmetric 


ti^t^lOO 






110 

too 

90 
80 
70 
60 
56 


1.55 
1.07 
0.94 
1.28 
1.85 
2.49 
3.16 
3.84 




1.49 
1.06 
0.93 
1.21 
1.71 
2.28 
2.89 
3.51 


1.76 
1.27 
0.99 
1 .08 
1.48 
2.00 
2.58 
3.19 


.16 
.09 
.06 
. 14 
.25 
.36 
.47 
.57 


.26 
.22 
.08 

- 99 

.39 
.52 
.63 
.72 




Nearly 


Symmetric 


nj-=ti2 = 250 






50 
40 
30 
20 
10 
6 


1.12 
0.73 
0.44 
6.46 
6.78 
1. 16 




1.12 
0.74 
0.45 
0.46 
0.77 
1.15 


U20 
0.78 
0.45 
0.48 
0.82 
1 .25 


.07 
.05 
.02 
.02 
.05 
.07 


.10 
.06 
.02 
.03 
.06 
.11 




Nearly 


Symmetric 


njin^ 100 






50 
40 
30 
20 
10 
0 


1.82 
1.26 
6;73 
0.77 
1.27 
1.90 




1.77 
1.16 
0.70 
0.73 
1.22 
1.83 


1.89 
1.23 
6.71 
0.75 
1.31 
1.98 


.20 
.13 
.06 
.07 
.13 
.20 


.17 
.11 
.05 
.06 
.11 
.18 



33 



- 27 - 



TABLE 5 



Raw Score 


Summary 


Statistics for Forms X and 


Y and Common 


t terns v 




for a 


Professional 


Certification 


Program 










Standard 






Variable 


Group 


Mean 


Deviation 


Skewness 


Kurtbsis 


X 


i 


95.75 


13.38 


-1.03 


3.9i 


Y 


2 


96.84 


13.37 


-1.00 


3.89 


V 


1 


23.18 


4.05 


-0.84 


3.48 


V 


2 


22.54 


4.31 


-0.79 


3.47 



Note: Sfcewhess is Pearson's /g t and kartosis is Pearson's 3 2 index. 

Sample sizes are 773 and 795 for Groups 1 and 2, respectively. 
There are 125 items on X and Y and 30 items on V. 



34 

ERIC 



- 28 - 



TABLE 6 

Standard Errors of Tucker Equating 
for a Professional Certification Program 



Standard Errors 



Form X Percentile Rank Form Y ~ ~" - J 

Raw Score In Group 1 equivalent Normality Nonnormality Bootstrap 



125 
120 



105 

too 

95 



100.0 126.5 0.71 0.67 0.69 

99.9 121.7 0.61 0.56 0.58 

115 98.0 116.9 0.53 0.47 9.48 

tio 90.1 112.2 0.44 0.38 0.39 

73.4 107.4 0.38 0.32 9.32 

54.7 102.7 0.33 0.29 9.28 

40.2 97.9 0.31 0.30 0.30 

90 27.3 93.1 0.32 0.36 9.35 

35 13.6 88.4 0.37 0.44 0.44 

80 12.1 83.6 0.44 0.54 0.53 

75 8.9 78.9 0.52 0.64 0.64 

70 6.9 74.1 0.61 0.74 9.75 

65 3.8 69.4 9.79 9.85 9.86 

60 2.3 64.6 9.89 0.96 0.97 

55 0.6 59.8 0.90 1.08 1-Q8 

50 0.0 55.1 1.09 1.19 1.20 



1 Based on B = 1909 bootstrap replications, 



35 



